Biostat 203B Homework 2

Due Feb 9 @ 11:59PM

Author

Brilla Meng, UID:806329681

Display machine information for reproducibility:

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.1    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.1       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)

Display memory information of your computer

memuse::Sys.meminfo()
Totalram:   16.000 GiB 
Freeram:   188.922 MiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/mimic/hosp/
total 3817600
-rw-r--r--@ 1 brilla  staff    15516088 Jan  5  2023 admissions.csv.gz
-rw-r--r--@ 1 brilla  staff      427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-r--r--@ 1 brilla  staff      859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-r--r--@ 1 brilla  staff      578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-r--r--@ 1 brilla  staff       12900 Jan  5  2023 d_labitems.csv.gz
-rw-r--r--@ 1 brilla  staff    25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-r--r--@ 1 brilla  staff     7426955 Jan  5  2023 drgcodes.csv.gz
-rw-r--r--@ 1 brilla  staff   508524623 Jan  5  2023 emar.csv.gz
-rw-r--r--@ 1 brilla  staff   471096030 Jan  5  2023 emar_detail.csv.gz
-rw-r--r--@ 1 brilla  staff     1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-r--r--@ 1 brilla  staff  1939088924 Jan  5  2023 labevents.csv.gz
-rw-r--r--@ 1 brilla  staff    96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-r--r--@ 1 brilla  staff     2312631 Jan  5  2023 patients.csv.gz
-rw-r--r--@ 1 brilla  staff   398753125 Jan  5  2023 pharmacy.csv.gz
-rw-r--r--@ 1 brilla  staff   498505135 Jan  5  2023 poe.csv.gz
-rw-r--r--@ 1 brilla  staff    25477219 Jan  5  2023 poe_detail.csv.gz
-rw-r--r--@ 1 brilla  staff   458817415 Jan  5  2023 prescriptions.csv.gz
-rw-r--r--@ 1 brilla  staff     6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-r--r--@ 1 brilla  staff      122507 Jan  5  2023 provider.csv.gz
-rw-r--r--@ 1 brilla  staff     6781247 Jan  5  2023 services.csv.gz
-rw-r--r--@ 1 brilla  staff    36158338 Jan  5  2023 transfers.csv.gz
ls -l ~/mimic/icu/
total 4819968
-rw-r--r--@ 1 brilla  staff       35893 Jan  5  2023 caregiver.csv.gz
-rw-r--r--@ 1 brilla  staff  2467761053 Jan  5  2023 chartevents.csv.gz
-rw-r--r--@ 1 brilla  staff       57476 Jan  5  2023 d_items.csv.gz
-rw-r--r--@ 1 brilla  staff    45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-r--r--@ 1 brilla  staff     2614571 Jan  5  2023 icustays.csv.gz
-rw-r--r--@ 1 brilla  staff   251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-r--r--@ 1 brilla  staff   324218488 Jan  5  2023 inputevents.csv.gz
-rw-r--r--@ 1 brilla  staff    38747895 Jan  5  2023 outputevents.csv.gz
-rw-r--r--@ 1 brilla  staff    20717852 Jan  5  2023 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)

answer:

#read.csv(("~/mimic/hosp/admissions.csv.gz"), header = TRUE)
#read_csv("~/mimic/hosp/admissions.csv.gz")
#fread("~/mimic/hosp/admissions.csv.gz")
system.time(read.csv("~/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  3.470   0.061   3.578 
system.time(read_csv("~/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  0.762   0.120   0.603 
system.time(fread("~/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  0.299   0.028   0.352 
pryr::object_size(read.csv("~/mimic/hosp/admissions.csv.gz"))
158.71 MB
pryr::object_size(read_csv("~/mimic/hosp/admissions.csv.gz"))
55.31 MB
pryr::object_size(fread("~/mimic/hosp/admissions.csv.gz"))
50.13 MB

answer: fread() is the fastest, taking 0.046 seconds and uses the least memory at 50.13 MB. The other two functions, read.csv() and read_csv(), are slower and consume more memory, with no differences in default data types parsed.

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)

read_csv("~/mimic/hosp/admissions.csv.gz", col_types = cols())
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime          
        <dbl>    <dbl> <dttm>              <dttm>             
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00
# ℹ 431,221 more rows
# ℹ 12 more variables: deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>
time_taken <- system.time(
  admissions <- read_csv("~/mimic/hosp/admissions.csv.gz", col_types = cols())
)
memory_used <- pryr::object_size(admissions)
print(time_taken)
   user  system elapsed 
  0.781   0.099   0.488 
print(memory_used)
55.31 MB

answer: The system run time is half from the original version. The memory used is become 1/3 of original version.

Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.

ls -l ~/mimic/hosp/labevents.csv.gz
-rw-r--r--@ 1 brilla  staff  1939088924 Jan  5  2023 /Users/brilla/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file.

zcat < ~/mimic/hosp/labevents.csv.gz | head -10
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.

read_csv("~/mimic/hosp/labevents.csv.gz")

answer: It spends more than 5 minutes to read the file. It is not a good idea to use read_csv to read this file.

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

read_csv("~/mimic/hosp/labevents.csv.gz", col_select = c("subject_id", "itemid", "charttime", "valuenum"))
# A tibble: 118,171,367 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  51237 2180-03-23 11:51:00      1.4
 2   10000032  51274 2180-03-23 11:51:00     15.1
 3   10000032  50853 2180-03-23 11:51:00     15  
 4   10000032  50861 2180-03-23 11:51:00    102  
 5   10000032  50862 2180-03-23 11:51:00      3.3
 6   10000032  50863 2180-03-23 11:51:00    109  
 7   10000032  50864 2180-03-23 11:51:00      8  
 8   10000032  50868 2180-03-23 11:51:00     12  
 9   10000032  50878 2180-03-23 11:51:00    143  
10   10000032  50882 2180-03-23 11:51:00     27  
# ℹ 118,171,357 more rows

answer: yes, it spends less than than original one.

Q2.3 Ingest subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)

zcat < ~/mimic/hosp/labevents.csv.gz | awk -F',' 'BEGIN {OFS=","} NR==1 || $5=="50912" || $5=="50912" || $5=="50971" || $5=="50983" || $5=="50902" || $5=="50882" || $5=="51221" || $5=="51301" {print $2, $5, $7, $10}' | gzip > labevents_filtered.csv.gz

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?

zcat < labevents_filtered.csv.gz | head -10
zcat < labevents_filtered.csv.gz | wc -l
subject_id,itemid,charttime,valuenum
10000032,50882,2180-03-23 11:51:00,27
10000032,50902,2180-03-23 11:51:00,101
10000032,50912,2180-03-23 11:51:00,0.4
10000032,50971,2180-03-23 11:51:00,3.7
10000032,50983,2180-03-23 11:51:00,136
10000032,51221,2180-03-23 11:51:00,45.4
10000032,51301,2180-03-23 11:51:00,3
10000032,51221,2180-05-06 22:25:00,42.6
10000032,51301,2180-05-06 22:25:00,5
zcat: (stdin): unexpected end of file
 2889731
system.time(read_csv("labevents_filtered.csv.gz"))
   user  system elapsed 
  1.323   0.163   0.590 

answer: The first 10 lines of the new file labevents_filtered.csv.gz are displayed. There are 2889731 lines in this new file. The system run time is 0.204 seconds.

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.

gzip -d -c  ~/mimic/hosp/labevents.csv.gz > labevents.csv

Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

labevents <- arrow::open_dataset("labevents.csv", format = "csv") %>% 
dplyr::select(subject_id, itemid, charttime, valuenum) %>% 
dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
collect()
time_taken <- system.time({
  labevents <- arrow::open_dataset("labevents.csv", format = "csv") %>%
    dplyr::select(subject_id, itemid, charttime, valuenum) %>%
    dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))%>%
collect()
})
print(time_taken)
   user  system elapsed 
 31.548   1.700  29.628 
nrow(labevents)
[1] 24855909
head(labevents, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 04:51:00     27  
 2   10000032  50902 2180-03-23 04:51:00    101  
 3   10000032  50912 2180-03-23 04:51:00      0.4
 4   10000032  50971 2180-03-23 04:51:00      3.7
 5   10000032  50983 2180-03-23 04:51:00    136  
 6   10000032  50931 2180-03-23 04:51:00     95  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  

answer: It takes 1.754 seconds for the system.time to ingest labevents.csv by Apache Arrow. There are 24855909 rows in the result tibble. The first 10 rows of the result tibble match those in Q2.3.

Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator. answer: Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides libraries for manipulating and analyzing in-memory data, and it is designed to be used in a wide range of applications, from small scripts to large-scale distributed systems.

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

labevents %>%
  arrow::write_dataset(format = "parquet", path = "labevents_filtered.parquet")
  arrow::open_dataset("labevents_filtered.parquet",format ="parquet") %>%
  dplyr::select(subject_id, itemid, charttime, valuenum) %>%
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
  collect()
# A tibble: 24,855,909 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 04:51:00     27  
 2   10000032  50902 2180-03-23 04:51:00    101  
 3   10000032  50912 2180-03-23 04:51:00      0.4
 4   10000032  50971 2180-03-23 04:51:00      3.7
 5   10000032  50983 2180-03-23 04:51:00    136  
 6   10000032  50931 2180-03-23 04:51:00     95  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  
# ℹ 24,855,899 more rows
  file.size("labevents_filtered.parquet")
[1] 96
time_taken <- system.time({
    labevents <- arrow::open_dataset("labevents_filtered.parquet", format = "parquet") %>%
      dplyr::select(subject_id, itemid, charttime, valuenum) %>%
      dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
      collect()
  })
  print(time_taken)
   user  system elapsed 
  0.554   0.083   0.199 
  nrow(labevents)
[1] 24855909
  head(labevents, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 04:51:00     27  
 2   10000032  50902 2180-03-23 04:51:00    101  
 3   10000032  50912 2180-03-23 04:51:00      0.4
 4   10000032  50971 2180-03-23 04:51:00      3.7
 5   10000032  50983 2180-03-23 04:51:00    136  
 6   10000032  50931 2180-03-23 04:51:00     95  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  

answer: The Parquet file labevents_filtered.parquet is 96. It takes 0.295 seconds for the system.time to ingest labevents_filtered.parquet by Apache Arrow. There are 24855909 rows in the result tibble. The first 10 rows of the result tibble match those in Q2.3.

Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator. answer: Parquet is a columnar storage file format that is optimized for reading and writing large datasets. It is designed to be efficient for both compression and encoding, and it is especially good for complex data types. It is also designed to be used in a wide range of applications, from small scripts to large-scale distributed systems.

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

labevents_duckdb <- arrow::open_dataset("labevents_filtered.parquet", format = "parquet") %>%
    arrow::to_duckdb() %>% # Ensure to reference arrow explicitly
    dplyr::select(subject_id, itemid, charttime, valuenum) %>%
    dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
    dplyr::arrange(subject_id, charttime) %>%
    collect()
time_taken <- system.time({
    labevents_duckdb <- arrow::open_dataset("labevents_filtered.parquet", format = "parquet") %>%
    arrow::to_duckdb() %>% # Ensure to reference arrow explicitly
    dplyr::select(subject_id, itemid, charttime, valuenum) %>%
    dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
    dplyr::arrange(subject_id, charttime) %>%
    collect()
  })
print(time_taken)
   user  system elapsed 
  4.316   1.771   2.240 
nrow(labevents_duckdb)
[1] 24855909
head(labevents_duckdb, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 11:51:00     27  
 2   10000032  50902 2180-03-23 11:51:00    101  
 3   10000032  50912 2180-03-23 11:51:00      0.4
 4   10000032  50971 2180-03-23 11:51:00      3.7
 5   10000032  50983 2180-03-23 11:51:00    136  
 6   10000032  50931 2180-03-23 11:51:00     95  
 7   10000032  51221 2180-03-23 11:51:00     45.4
 8   10000032  51301 2180-03-23 11:51:00      3  
 9   10000032  51221 2180-05-06 22:25:00     42.6
10   10000032  51301 2180-05-06 22:25:00      5  

answer: It takes 1.686 seconds for the system.time to ingest labevents_filtered.parquet by Apache Arrow. There are 24855909 rows in the result tibble. The first 10 rows of the result tibble match those in Q2.3.

Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator. answer: DuckDB is an embeddable SQL OLAP database management system. It is designed to be used as an embedded database, which means it can be easily integrated into other software applications. It is optimized for analytical queries, and it is designed to be fast and efficient for complex queries on large datasets. It is also designed to be easy to use, with a simple and intuitive SQL interface.

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head -10
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head -10
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

gzip -d -c  ~/mimic/icu/chartevents.csv.gz > chartevents.csv
chartevents <- arrow::open_dataset("chartevents.csv", format = "csv") %>% 
dplyr::select(subject_id, itemid, charttime, valuenum) %>% 
dplyr::filter(itemid %in% c(220045, 220181, 220179, 223761, 220210)) %>%
collect()
nrow(chartevents)
[1] 22502319
head(chartevents, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032 220179 2180-07-23 14:01:00       82
 2   10000032 220181 2180-07-23 14:01:00       63
 3   10000032 220045 2180-07-23 15:00:00       94
 4   10000032 220179 2180-07-23 15:00:00       85
 5   10000032 220181 2180-07-23 15:00:00       62
 6   10000032 220210 2180-07-23 15:00:00       20
 7   10000032 220045 2180-07-23 12:00:00       97
 8   10000032 220179 2180-07-23 12:00:00       93
 9   10000032 220181 2180-07-23 12:00:00       56
10   10000032 220210 2180-07-23 12:00:00       16

answer: There are 22502319 rows in the result tibble. The first 10 rows of the result tibble are shown.